knitr::opts_chunk$set(echo = TRUE)
#install.packages("dplyr","ade4","magrittr","cluster","factoextra","cluster.datasets","xtable","kableExtra","knitr","summarytools")
knitr::opts_chunk$set(echo = TRUE)
A distance function or a metric on \(\mathbb{R}^m,\:m\geq 1\), is a function \(d:\mathbb{R}^m\times\mathbb{R}^m\rightarrow \mathbb{R}\).
A distance function must satisfy some required properties or axioms.
There are three main axioms.
A1. \(d(\mathbf{x},\mathbf{y})= 0\iff \mathbf{x}=\mathbf{y}\) (identity of indiscernibles);
A2. \(d(\mathbf{x},\mathbf{y})= d(\mathbf{y},\mathbf{x})\) (symmetry);
A3. \(d(\mathbf{x},\mathbf{z})\leq d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{z})\) (triangle inequality), where \(\mathbf{x}=(x_1,\cdots,x_m)\), \(\mathbf{y}=(y_1,\cdots,y_m)\) and \(\mathbf{z}=(z_1,\cdots,z_m)\) are all vectors of \(\mathbb{R}^m\).
We should use the term dissimilarity rather than distance when not all the three axioms A1-A3 are valid.
Most of the time, we shall use, with some abuse of vocabulary, the term distance.
\[ d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{j=1}^m (x_j-y_j)^2}. \]
\[d(\mathbf{x},\mathbf{y}) =\sum_{j=1}^m |x_j-y_j|.\]
x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "euclidian")
## x
## y 8.485281
dist(rbind(x, y), method = "euclidian",diag=T,upper=T)
## x y
## x 0.000000 8.485281
## y 8.485281 0.000000
6*sqrt(2)
## [1] 8.485281
dist(rbind(x, y), method = "manhattan")
## x
## y 12
dist(rbind(x, y), method = "manhattan",diag=T,upper=T)
## x y
## x 0 12
## y 12 0
\[d(\mathbf{x},\mathbf{y}) =\sum_{j=1}^m \frac{|x_j-y_j|}{|x_j|+|y_j|}.\]
x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "canberra")
## x
## y 2
6/6+6/6
## [1] 2
\[d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_p.\]
\[\|\mathbf{x}\|_p=d(\mathbf{x},\mathbf{0}),\]
where \(\mathbf{0}\) is the null-vetor of \(\mathbb{R}^m\).
library("ggplot2")
x = c(0, 0)
y = c(6,6)
MinkowDist=c() # Initialiser à vide la liste
for (p in seq(1,30,.01))
{
MinkowDist=c(MinkowDist,dist(rbind(x, y), method = "minkowski", p = p))
}
ggplot(data =data.frame(x = seq(1,30,.01), y=MinkowDist ) , mapping = aes( x=x, y= y))+
geom_point(size=.1,color="red")+
xlab("p")+ylab("Minkowski Distance")+ggtitle("Minkowski distance wrt p")
Produce a similar graph using “The Economist” theme. Indicate on the graph the Manhattan, the Euclidian distances as well as the Chebyshev distance introduced below.
At the limit, we get the Chebyshev distance which is defined by: \[ d(\mathbf{x},\mathbf{y})=\max_{j=1,\cdots,n}(|x_j-y_j|)=\lim_{p\rightarrow\infty} \left[\sum_{j=1} |x_j-y_j|^{p}\right]^{1/p}. \]
The corresponding norm is:
\[ \|\mathbf{x}\|_\infty=\max_{j=1,\cdots,n}(|x_j|). \]
The proof of the triangular inequality A3 is based on the Minkowski inequality:
For any nonnegative real numbers \(a_1,\cdots,a_m\); \(b_1,\cdots,b_m\), and for any \(p\geq 1\), we have: \[ \left[\sum_{j=1}^m (a_j+b_j)^{p}\right]^{1/p}\leq \left[\sum_{j=1}^m a_j^{p}\right]^{1/p} +\left[\sum_{j=1}^m b_j^{p}\right]^{1/p}. \]
To prove that the Minkowski distance satisfies A3, notice that \[ \sum_{j=1}^m|x_j-z_j|^{p}= \sum_{j=1}^m|(x_j-y_j)+(y_j-z_j)|^{p}. \]
Since for any reals \(x,y\), we have: \(|x+y|\leq |x|+|y|\), and using the fact that \(x^p\) is increasing in \(x\geq 0\), we obtain: \[ \sum_{j=1}^m|x_j-z_j|^{p}\leq \sum_{j=1}^m(|x_j-y_j|+|y_j-z_j|)^{p}. \]
Applying the Minkowski inequality with \(a_j=|x_j-y_j|\) and \(b_j=|y_j-z_j|\), \(j=1,\cdots,n\), we get: \[ \sum_{j=1}^m|x_j-z_j|^{p}\leq \left(\sum_{j=1}^m |x_j-y_j|^{p}\right)^{1/p}+\left(\sum_{j=1}^m |y_j-z_j|^{p}\right)^{1/p}. \]
To illustrate the Minkowski inequality, draw \(100\) times two lists of \(100\) draws from the lognormal distribution with mean \(1600\) and standard-deviation \(300\). Illustrate with a graph the gap between the two drawn lists.
# Cauchy-Schwartz inequality
The Pearson correlation coefficient is a similarity measure on \(\mathbb{R}^m\) defined by: \[ \rho(\mathbf{x},\mathbf{y})= \frac{\sum_{j=1}^m (x_j-\bar{\mathbf{x}})(y_j-\bar{\mathbf{y}})}{{\sqrt{\sum_{j=1}^m (x_j-\bar{\mathbf{x}})^2\sum_{j=1}^m (y_j-\bar{\mathbf{y}})^2}}}, \] where \(\bar{\mathbf{x}}\) is the mean of the vector \(\mathbf{x}\) defined by: \[\bar{\mathbf{x}}=\frac{1}{n}\sum_{j=1}^m x_j,\]
Note that the Pearson correlation coefficient satisfies P2 and is invariant to any positive linear transformation, i.e.: \[\rho(\alpha\mathbf{x},\mathbf{y})=\rho(\mathbf{x},\mathbf{y}),\] for any \(\alpha>0\).
The Pearson distance (or correlation distance) is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\mathbf{x},\mathbf{y}). \]
Note that the Pearson distance does not satisfy A1 since \(d(\mathbf{x},\mathbf{x})=0\) for any non-zero vector \(\mathbf{x}\). It neither satisfies the triangle inequality. However, the symmetry property is fullfilled.
x=c(3, 1, 4, 15, 92)
rank(x)
## [1] 2 1 3 4 5
x=c(3, 1, 4, 15, 92)
rank(x)
## [1] 2 1 3 4 5
y=c(30,2 , 9, 20, 48)
rank(y)
## [1] 4 1 2 3 5
d=rank(x)-rank(y)
d
## [1] -2 0 1 1 0
cor(rank(x),rank(y))
## [1] 0.7
1-6*sum(d^2)/(5*(5^2-1))
## [1] 0.7
x=c(3, 1, 4, 15, 92)
y=c(30,2 , 9, 20, 48)
tau=0
for (i in 1:5)
{
tau=tau+sign(x -x[i])%*%sign(y -y[i])
}
tau=tau/(5*4)
tau
## [,1]
## [1,] 0.6
cor(x,y, method="kendall")
## [1] 0.6
v=c(3, 1, 4, 15, 92)
w=c(30,2 , 9, 20, 48)
(v-mean(v))/sd(v)
## [1] -0.5134116 -0.5647527 -0.4877410 -0.2053646 1.7712699
scale(v)
## [,1]
## [1,] -0.5134116
## [2,] -0.5647527
## [3,] -0.4877410
## [4,] -0.2053646
## [5,] 1.7712699
## attr(,"scaled:center")
## [1] 23
## attr(,"scaled:scale")
## [1] 38.9551
(w-mean(w))/sd(w)
## [1] 0.45263128 -1.09293895 -0.70654639 -0.09935809 1.44621214
scale(w)
## [,1]
## [1,] 0.45263128
## [2,] -1.09293895
## [3,] -0.70654639
## [4,] -0.09935809
## [5,] 1.44621214
## attr(,"scaled:center")
## [1] 21.8
## attr(,"scaled:scale")
## [1] 18.11629
Consider the following example
Plot the data using a nice scatter plot.
Transform the Height from centimeters (cm) into feet (ft).
Display your data in a table.
Plot the data within a new scatter plot.
What do you observe?
Standardize the two variables Age and Height.
Display your data in a table.
Plot the standardized data within a new scatter plot.
Conclude.
A common simple situation occurs when all information is of the presence/absence of 2-level qualitative characters.
We assume there are \(n\) characters.
*The presence of the character is coded by \(1\) and the absence by 0.
We have have at our disposal two vectors.
\(\mathbf{x}\) is observed for a first individual (or object).
\(\mathbf{y}\) is observed for a second individual.
We can then calculate the following four statistics:
\(a=\mathbf{x\cdot y}=\sum_{j=1}^mx_jy_j.\)
\(b=\mathbf{x\cdot (1-y)}=\sum_{j=1}^mx_j(1-y_j).\)
\(c=\mathbf{(1-x)\cdot y}=\sum_{j=1}^m(1-x_j)y_j.\)
\(d=\mathbf{(1-x)\cdot (1-y)}=\sum_{j=1}^m(1-x_j)(1-y_j).\)
The counts of matches are \(a\) for \((1,1)\) and \(d\) for \((0,0)\);
The counts of mismatches are \(b\) for \((1,0)\) and \(c\) for \((0,1)\).
Note that obviously: \(a+b+c+d= n\).
This gives a very useful \(2 \times 2\) association table.
| Second individual | ||||
|---|---|---|---|---|
| 1 | 0 | Totals | ||
| First individual | 1 | \(a\) | \(b\) | \(a+b\) |
| 0 | \(c\) | \(d\) | \(c+d\) | |
| Totals | \(a+c\) | \(b+d\) | \(n\) |
data=c(
1,0,1,1,0,0,1,0,0,0,
0,1,0,0,1,0,0,0,0,0,
0,0,1,0,0,0,1,0,0,1,
0,1,0,0,0,0,0,1,1,0,
1,1,0,0,1,1,0,1,1,0,
1,1,0,0,1,0,1,1,0,0,
0,0,0,1,0,1,0,0,0,0,
0,0,0,1,0,1,0,0,0,0
)
data=data.frame(matrix(data, nrow=8,byrow=T))
row.names(data)=c("Ilan","Jacqueline","Kim","Lieve","Leon","Peter","Talia","Tina")
names(data)=c("Sex", "Married", "Fair Hair", "Blue Eyes", "Wears Glasses", "Round Face", "Pessimist", "Evening Type", "Is an Only Child", "Left-Handed")
library(knitr)
library(xtable)
library(stargazer)
library(texreg)
library(kableExtra)
library(summarytools)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
set.seed(893)
datat<-as.data.frame(t(data))
datat=lapply(datat,as.factor)
Ilan=datat$Ilan
Talia =datat$Talia
print(ctable(Ilan,Talia,prop = 'n',style = "rmarkdown"))
| Talia | 0 | 1 | Total | |
| Ilan | ||||
| 0 | 5 | 1 | 6 | |
| 1 | 3 | 1 | 4 | |
| Total | 8 | 2 | 10 |
| Coefficient | \(s(\mathbf{x},\mathbf{y})\) | \(d(\mathbf{x},\mathbf{y})=1-s(\mathbf{x},\mathbf{y})\) |
|---|---|---|
| Simple matching | \(\frac{a+d}{a+b+c+d}\) | \(\frac{b+c}{a+b+c+d}\) |
| Jaccard | \(\frac{a}{a+b+c}\) | \(\frac{b+c}{a+b+c}\) |
| Rogers and Tanimoto (1960) | \(\frac{a+d}{a+2(b+c)+d}\) | \(\frac{2(b+c)}{a+2(b+c)+d}\) |
| Gower and Legendre (1986) | \(\frac{2(a+d)}{2(a+d)+b+c}\) | \(\frac{b+c}{2(a+d)+b+c}]\) |
| Gower and Legendre (1986) | \(\frac{2a}{2a+b+c}\) | \(\frac{b+c}{2a+b+c}\) |
To calculate these coefficients, we use the function: dist.binary(). available in the ade4 package.
All the distances in the ade4 package are of type \(d(\mathbf{x}.\mathbf{y})= \sqrt{1 - s(\mathbf{x}.\mathbf{y})}\).
library(ade4)
a=1
b=3
c=1
d=5
dist.binary(data[c("Ilan","Talia"),],method=2)^2
Ilan
Talia 0.4
1-(a+d )/(a+b+c+d)
[1] 0.4
dist.binary(data[c("Ilan","Talia"),],method=1)^2
Ilan
Talia 0.8
1-a/(a+b+c)
[1] 0.8
dist.binary(data[c("Ilan","Talia"),],method=4)^2
Ilan
Talia 0.5714286
1-(a+d )/(a+2*(b+c)+d)
[1] 0.5714286
# One Gower coefficient is missing
dist.binary(data[c("Ilan","Talia"),],method=5)^2
Ilan
Talia 0.6666667
1-2*a/(2*a+b+c)
[1] 0.6666667
From: GAN et al
Gower’s coefficient is a dissimilarity measure specifically designed for handling mixed attribute types or variables.
See: GOWER, John C. A general coefficient of similarity and some of its properties. Biometrics, 1971, p. 857-871.
The coefficient is calculated as the weighted average of attribute contributions.
Weights usually used only to indicate which attribute values could actually be compared meaningfully.
The formula is: \[ d(\mathbf{x},\mathbf{y})=\frac{\sum_{j=1}^m w_j \delta(x_j,y_j)}{\sum_{j=1}^m w_j}. \]
The wheight \(w_j\) is put equal to \(1\) when both measurements \(x_j\) and \(y_j\) are nonmissing,
The number \(\delta(x_j,y_j)\) is the contribution of the \(j\)th measure or variable to the dissimilarity measure.
If the \(j\)th measure is nominal, we take
\[
\delta(x_j,y_j)\equiv \begin{cases}0,
\text{ if } x_j=y_j;\\1,\text{ if } x_j \neq y_j.\end{cases}
\]
If the \(j\)th measure is interval-scaled, we take instead: \[ \delta(x_j,y_j)\equiv \frac{|x_j-y_j|}{R_j}, \] where \(R_j\) is the range of variable \(j\) over the available data.
Consider the following data set:
Begonia
Genêt
library(cluster)
library(dplyr)
data <-flower %>%
rename(Winters=V1,Shadow=V2,Tubers=V3,Color=V4,Soil=V5,Preference=V6,Height=V7,Distance=V8) %>%
mutate(Winters=recode(Winters,"1"="Yes","0"="No"),
Shadow=recode(Shadow,"1"="Yes","0"="No"),
Tubers=recode(Tubers,"1"="Yes","0"="No"),
Color=recode(Color,"1"="white", "2"="yellow", "3"= "pink", "4"="red", "5"="blue"),
Soil=recode(Soil,"1"="dry", "2"="normal", "3"= "wet")
)
row.names(data)=c("Begonia","Broom","Camellia","Dahlia","Forget-me-not","Fuchsia",
"Geranium", "Gladiolus","Heather","Hydrangea","Iris","Lily","Lily-of-the-valley",
"Peony","Pink Carnation","Red Rose","Scotch Rose","Tulip")
res=lapply(data,class)
res=as.data.frame(res)
res[1,] %>%
knitr::kable()
| Winters | Shadow | Tubers | Color | Soil | Preference | Height | Distance |
|---|---|---|---|---|---|---|---|
| factor | factor | factor | factor | ordered | ordered | numeric | numeric |
flower[1:2,]
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0 1 1 4 3 15 25 15
## 2 1 0 0 2 1 3 150 50
max(data$Height)-min(data$Height)
## [1] 180
max(data$Distance)-min(data$Distance)
## [1] 50
\[ \frac{|1-0|+|0-1|+|0-1|+1+|1-3|/2+|3-15|/17+|150-25|/180+|50-15|/50}{8}\approx 0.8875408 \]
Cluster package description available at this link.
library(cluster)
(abs(1-0)+abs(0-1)+abs(0-1)+1+abs(1-3)/2+abs(3-15)/17+abs(150-25)/180+abs(50-15)/50)/8
## [1] 0.8875408
dist<-daisy(data[,1:8],metric = "Gower")
as.matrix(dist)[1:2,1:2]
## Begonia Broom
## Begonia 0.0000000 0.8875408
## Broom 0.8875408 0.0000000
stargazer(USArrests,header=TRUE, type='html',summary=FALSE,digits=1)
| Murder | Assault | UrbanPop | Rape | |
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9 | 276 | 91 | 40.6 |
| Colorado | 7.9 | 204 | 78 | 38.7 |
| Connecticut | 3.3 | 110 | 77 | 11.1 |
| Delaware | 5.9 | 238 | 72 | 15.8 |
| Florida | 15.4 | 335 | 80 | 31.9 |
| Georgia | 17.4 | 211 | 60 | 25.8 |
| Hawaii | 5.3 | 46 | 83 | 20.2 |
| Idaho | 2.6 | 120 | 54 | 14.2 |
| Illinois | 10.4 | 249 | 83 | 24 |
| Indiana | 7.2 | 113 | 65 | 21 |
| Iowa | 2.2 | 56 | 57 | 11.3 |
| Kansas | 6 | 115 | 66 | 18 |
| Kentucky | 9.7 | 109 | 52 | 16.3 |
| Louisiana | 15.4 | 249 | 66 | 22.2 |
| Maine | 2.1 | 83 | 51 | 7.8 |
| Maryland | 11.3 | 300 | 67 | 27.8 |
| Massachusetts | 4.4 | 149 | 85 | 16.3 |
| Michigan | 12.1 | 255 | 74 | 35.1 |
| Minnesota | 2.7 | 72 | 66 | 14.9 |
| Mississippi | 16.1 | 259 | 44 | 17.1 |
| Missouri | 9 | 178 | 70 | 28.2 |
| Montana | 6 | 109 | 53 | 16.4 |
| Nebraska | 4.3 | 102 | 62 | 16.5 |
| Nevada | 12.2 | 252 | 81 | 46 |
| New Hampshire | 2.1 | 57 | 56 | 9.5 |
| New Jersey | 7.4 | 159 | 89 | 18.8 |
| New Mexico | 11.4 | 285 | 70 | 32.1 |
| New York | 11.1 | 254 | 86 | 26.1 |
| North Carolina | 13 | 337 | 45 | 16.1 |
| North Dakota | 0.8 | 45 | 44 | 7.3 |
| Ohio | 7.3 | 120 | 75 | 21.4 |
| Oklahoma | 6.6 | 151 | 68 | 20 |
| Oregon | 4.9 | 159 | 67 | 29.3 |
| Pennsylvania | 6.3 | 106 | 72 | 14.9 |
| Rhode Island | 3.4 | 174 | 87 | 8.3 |
| South Carolina | 14.4 | 279 | 48 | 22.5 |
| South Dakota | 3.8 | 86 | 45 | 12.8 |
| Tennessee | 13.2 | 188 | 59 | 26.9 |
| Texas | 12.7 | 201 | 80 | 25.5 |
| Utah | 3.2 | 120 | 80 | 22.9 |
| Vermont | 2.2 | 48 | 32 | 11.2 |
| Virginia | 8.5 | 156 | 63 | 20.7 |
| Washington | 4 | 145 | 73 | 26.2 |
| West Virginia | 5.7 | 81 | 39 | 9.3 |
| Wisconsin | 2.6 | 53 | 66 | 10.8 |
| Wyoming | 6.8 | 161 | 60 | 15.6 |
set.seed(123)
ss <- sample(1:50,15)
df <- USArrests[ss, ]
df.scaled <- scale(df)
stargazer(df.scaled,header=TRUE, type='html',summary=FALSE,digits=1)
| Murder | Assault | UrbanPop | Rape | |
| New Mexico | 0.6 | 1.0 | 0.2 | 0.6 |
| Iowa | -1.7 | -1.5 | -0.7 | -1.4 |
| Indiana | -0.5 | -0.9 | -0.1 | -0.5 |
| Arizona | -0.2 | 1.1 | 0.9 | 0.5 |
| Tennessee | 1.0 | -0.1 | -0.5 | 0.1 |
| Texas | 0.9 | 0.1 | 0.9 | -0.04 |
| Oregon | -1.0 | -0.4 | 0.01 | 0.3 |
| West Virginia | -0.8 | -1.3 | -2.0 | -1.6 |
| Missouri | -0.01 | -0.2 | 0.2 | 0.2 |
| Montana | -0.8 | -1.0 | -1.0 | -0.9 |
| Nebraska | -1.2 | -1.0 | -0.3 | -0.9 |
| California | -0.01 | 0.9 | 1.7 | 1.4 |
| South Carolina | 1.3 | 1.0 | -1.3 | -0.3 |
| Nevada | 0.8 | 0.7 | 1.0 | 2.0 |
| Florida | 1.6 | 1.6 | 0.9 | 0.6 |
Remark: All these functions compute distances between rows of the data.
Remark: If we want to compute pairwise distances between variables, we must transpose the data to have variables in the rows.
We first compute Euclidian distances
dist.eucl <- dist(df.scaled, method = "euclidean",upper = TRUE)
stargazer(as.data.frame(as.matrix(dist.eucl)[1:3, 1:3]),header=TRUE, type='html',summary=FALSE,digits=1)
| New Mexico | Iowa | Indiana | |
| New Mexico | 0 | 4.1 | 2.5 |
| Iowa | 4.1 | 0 | 1.8 |
| Indiana | 2.5 | 1.8 | 0 |
round(sqrt(sum((df.scaled["New Mexico",]-df.scaled["Iowa",])^2)),1)
[1] 4.1
round(sqrt(sum((df.scaled["New Mexico",]-df.scaled["Indiana",])^2)),1)
[1] 2.5
round(sqrt(sum((df.scaled["Iowa",]
-df.scaled["Indiana",])^2)),1)
[1] 1.8
library("factoextra")
dist.cor <- get_dist(df.scaled, method = "pearson")
round(as.matrix(dist.cor)[1:3, 1:3], 1)
## New Mexico Iowa Indiana
## New Mexico 0.0 1.7 2.0
## Iowa 1.7 0.0 0.3
## Indiana 2.0 0.3 0.0
round(1-cor(df.scaled["New Mexico",],df.scaled["Iowa",]),1)
## [1] 1.7
round(1-cor(df.scaled["New Mexico",],df.scaled["Indiana",]),1)
## [1] 2
round(1-cor(df.scaled["Iowa",],df.scaled["Indiana",]),1)
## [1] 0.3
library(factoextra)
fviz_dist(dist.eucl)
Data
library(multicool)
## Loading required package: Rcpp
Stirling2(8,3)
## [1] 966
Stirling2(8,1)
## [1] 1
Stirling2(3,2)
## [1] 3
library(gmp)
Stirling2(5,3)
## Big Integer ('bigz') :
## [1] 25
A plausible neighborhood for a partition is the set of partitions obtained by transferring an object from one cluster to another.
For the partition (BB BR) (BS) (BC BH), the neighborhood consists of the following ten partitions:
Let \(\mathbf{x}_i\equiv (x_i^1,\cdots,x_i^m)\) the vector of values for the object \(i\), \(i=1,\cdots ,n.\)
The variables are assumed scaled.
The partition has \(k\) disjoint clusters: \(C_1,\cdots,C_k\), which are the indices of the objects in the various clusters.
Let \(n_l\) be the number of objects in cluster \(C_l\).
Each of the \(n\) objects lies in just one of the \(k\) clusters.
Note that \(\sum_{l=1}^k n_l=n\).
The vector of means over the objects in cluster \(C_l\) is denoted by \(\bar{\mathbf{x}}_{l}\), with \[ \bar{\mathbf{x}}_{l}\equiv\frac{1}{n_l}\sum_{i\in C_l}\mathbf{x}_{i}=(\bar{x}_l^1,\cdots,\bar{x}_l^m),\:l=1,\cdots, k, \] where \[ \bar{x}_l^j\equiv \frac{\sum_{i\in C_l}x_i^{j}}{n_l},\:j=1,\cdots, m; \:l=1,\cdots,k. \]
The distance between the object \(j\) and the cluster \(l\) is \(d(\mathbf{x}_i,\bar{\mathbf{x}}_l)\), where \(d\) is taken to be the Euclidian distance \[ d(\mathbf{x}_i,\bar{\mathbf{x}}_l)=||\mathbf{x}_i-\bar{\mathbf{x}}_l||=\bigg[\sum_{j=1}^m(x_i^j-\bar{x}_l^j)^2\bigg]^{1/2},\:i=1,\cdots,m;\:l=1,\cdots,k. \] where \(||\mathbf{\cdot}||\) is the Euclidian norm.
The error of the partition is taken to be
\[ e= \sum_{l=1}^{k}\sum_{i\in C_l} ||\mathbf{x}_i-\bar{\mathbf{x}}_l||^2. \]
where \(l(i)\) is the index of the cluster of object \(i\).
\[ \Delta e= \frac{n_ld^2(\mathbf{x}_1,\bar{\mathbf{x}}_l)}{n_l+1}-\frac{n_{l(1)}d^2(\mathbf{x}_1,\bar{\mathbf{x}}_{l(1)})}{n_{l(1)}-1},\:l=1,\cdots, k,\:l\neq l(1). \]
Prove that the error variation is indeed given by:
\[ \Delta e= \frac{n_ld^2(\mathbf{x}_1,\bar{\mathbf{x}}_l)}{n_l+1}-\frac{n_{l(1)}d^2(\mathbf{x}_1,\bar{\mathbf{x}}_{l(1)})}{n_{l(1)}-1},\:l=1,\cdots, k,\:l\neq l(1). \]
#library("cluster.datasets")
#write.csv(rda.meat.fish.fowl.1959,"Hartigandat%a1.csv")
df<-read.csv("Hartigandata1.csv")
print(df)
## X name energy protein fat calcium iron
## 1 1 Braised beef 11 29 28 1 26
## 2 2 Hamburger 8 30 17 1 27
## 3 3 Roast beef 18 21 39 1 20
## 4 4 Beefsteak 12 27 32 1 26
## 5 5 Canned beef 6 31 10 2 37
## 6 6 Broiled chicken 8 29 3 1 14
## 7 7 Canned chicken 5 36 7 2 15
## 8 8 Beef heart 5 37 5 2 59
## 9 9 Roast lamb leg 8 29 20 1 26
## 10 10 Roast lamb shoulder 9 26 25 1 23
## 11 11 Smoked ham 11 29 28 1 25
## 12 12 Pork roast 11 27 29 1 25
## 13 13 Pork simmered 11 27 30 1 24
## 14 14 Beef tongue 6 26 14 1 25
## 15 15 Veal cutlet 6 33 9 1 27
## 16 16 Baked bluefish 4 31 4 3 6
## 17 17 Raw clams 2 16 1 10 60
## 18 18 Canned clams 1 10 1 9 54
## 19 19 Canned crabmeat 3 20 2 5 8
## 20 20 Fried haddock 4 23 5 2 5
## 21 21 Broiled mackerel 6 27 13 1 10
## 22 22 Canned mackerel 5 23 9 20 18
## 23 23 Fried perch 6 23 11 2 13
## 24 24 Canned salmon 4 24 5 20 7
## 25 25 Canned sardines 6 31 9 46 25
## 26 26 Canned tuna 5 36 7 1 12
## 27 27 Canned shrimp 3 33 1 12 26
df<-df[1:8,c(3,4,6)]
df
## energy protein calcium
## 1 11 29 1
## 2 8 30 1
## 3 18 21 1
## 4 12 27 1
## 5 6 31 2
## 6 8 29 1
## 7 5 36 2
## 8 5 37 2
# The data set contains some errors
df[3,1]<-13 # Error in line 3
df[6,1]<-4 # Error at line 6
df[7,3]<-1 # Error at line 7
df
## energy protein calcium
## 1 11 29 1
## 2 8 30 1
## 3 13 21 1
## 4 12 27 1
## 5 6 31 2
## 6 4 29 1
## 7 5 36 1
## 8 5 37 2
rownames(df)<-c("BB","HR","BR","BS","BC","CB","CC","BH")
df
## energy protein calcium
## BB 11 29 1
## HR 8 30 1
## BR 13 21 1
## BS 12 27 1
## BC 6 31 2
## CB 4 29 1
## CC 5 36 1
## BH 5 37 2
colnames(df)<-c("Energy","Protein","Calcium")
df
## Energy Protein Calcium
## BB 11 29 1
## HR 8 30 1
## BR 13 21 1
## BS 12 27 1
## BC 6 31 2
## CB 4 29 1
## CC 5 36 1
## BH 5 37 2
km.res<-kmeans(df[1:8,],3,iter.max = 100)
km.res$cluster
## BB HR BR BS BC CB CC BH
## 3 3 2 3 1 1 1 1
km.res$centers
## Energy Protein Calcium
## 1 5.00000 33.25000 1.5
## 2 13.00000 21.00000 1.0
## 3 10.33333 28.66667 1.0
km.res$totss
## [1] 267.5
sum((df[1:8,]$Energy-mean(df[1:8,]$Energy))^2)+
sum((df[1:8,]$Protein-mean(df[1:8,]$Protein))^2)+
sum((df[1:8,]$Calcium-mean(df[1:8,]$Calcium))^2)
## [1] 267.5
7*var(df[1:8,]$Energy)+7*var(df[1:8,]$Protein)+7*var(df[1:8,]$Calcium)
## [1] 267.5
km.res$withinss
## [1] 47.75000 0.00000 13.33333
km.res$tot.withinss
## [1] 61.08333
km.res$betweenss
## [1] 206.4167
km.res$size
## [1] 4 1 3
km.res$iter
## [1] 1
Remark: The location of a knee is generally considered as an appropriate number of clusters.
library(ggplot2)
library(factoextra)
fviz_cluster(km.res, df, ellipse.type = "norm")
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
fviz_cluster(km.res, data = df[1:8,],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
pca=prcomp(df[1:8,], scale = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.4655 0.7938 0.47119
## Proportion of Variance 0.7159 0.2100 0.07401
## Cumulative Proportion 0.7159 0.9260 1.00000
data(decathlon2)
decathlon.active <- decathlon2[1:23, 1:10]
res.pca <- prcomp(decathlon.active, scale = TRUE)
summary(res.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0308 1.3559 1.1132 0.90523 0.83759 0.65029 0.55007
## Proportion of Variance 0.4124 0.1839 0.1239 0.08194 0.07016 0.04229 0.03026
## Cumulative Proportion 0.4124 0.5963 0.7202 0.80213 0.87229 0.91458 0.94483
## PC8 PC9 PC10
## Standard deviation 0.52390 0.39398 0.3492
## Proportion of Variance 0.02745 0.01552 0.0122
## Cumulative Proportion 0.97228 0.98780 1.0000
fviz_pca_biplot(res.pca)
Assume the data are clustered into \(k\) clusters.
For \(i=1,\cdots,n\), let: \[ a_i=\frac{1}{n_{l(i)}-1}\sum_{i^\prime \in i\in C_{l(i)}\setminus\{i\}}d(i,i^\prime), \] be the mean distance between \(i\) and all the points of the same cluster.
If \(n_{l(i)}=1\), we set \(a_i=0\).
It is interpreted as a measure of how well \(i\) is assigned to its cluster (smaller the value, better is the assignment).
Let also \[ b_i=\min_{l\neq l(i)}{\frac {1}{|n_{l}|}}\sum_{i^\prime\in C_l}d(i,i^\prime), \] be the “neighboring cluster” of \(i\).
We now define a silhouette of \(i\) as \[ s_i=\frac {b_i-a_i}{\max(a_i,b_i)} \text{ if }n_{l(i)}>1, \] and \[ s_i=0\text{ if }n_{l(i)}=1. \]
Alternatively, we have \[ s_i= \begin{cases} 1-a_i/b_i,&\:\text{if } a_i<b_i;\\ 0,&\:\text{if } a_i=b_i;\\ b_i/a_i-1,&\:\text{if } a_i>b_i. \end{cases} \]
From the above definition it is clear that \[ -1\leq s_i\leq 1\]
For \(s_i\) to be close to 1 we require \(a_i<<b_i\).
If \(s_i\) is close to \(-1\), \(i\) is more appropriately clustered in its neighbouring cluster.
An \(s_i\) near zero means that the \(i\) is on the border of two natural clusters.
library(knitr)
knitr::opts_chunk$set(echo = FALSE)
km.res<-kmeans(df[1:8,],3,iter.max = 100)
silhouette(km.res$cluster,dist(df[1:8,]))[,]%>%
data.frame()->slobj
mutate(slobj,Food=row.names(df[1:8,]))%>%
relocate(Food)->slobj
slobj%>%
arrange(,cluster)->slobj
##stargazer(slobj,type = "latex",summary = F,table.placement = "H")
knitr::kable(slobj, caption = "Silhouettes for Food data",digits = 4,col.names = c("Food","Cluster","Neighbor","Silhouette width"),align=c("l","c","c","c"),booktabs = TRUE)%>%
kable_classic(latex_options = "hold_position")
| Food | Cluster | Neighbor | Silhouette width |
|---|---|---|---|
| HR | 1 | 2 | 0.4659 |
| BC | 1 | 3 | 0.5168 |
| CB | 1 | 3 | 0.5312 |
| BB | 2 | 1 | -0.0053 |
| BR | 2 | 1 | 0.3785 |
| BS | 2 | 1 | 0.3921 |
| CC | 3 | 1 | 0.7764 |
| BH | 3 | 1 | 0.8062 |
##kable_styling(latex_options = "hold_position")
* The author obtain the following two figures.